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We discuss the problems of applying Maximum Likelihood methods to the CMB and how 
one can make it both efficient and optimal. The solution is a generalised eigenvalue problem 
that allows virtually no loss of information about the parameter being estimated, but can 
allow a substantial compression of the data set. We discuss the more difficult question of 
simultaneous estimation of many parameters, and propose solutions. A much fuller account 
of most of this work is available (Tegmark et al. 1997, hereafter LTH). 



1 Likelihood Analysis 



The standard method for extracting cosmological parameters from the CMB is through the use of Max- 
imum Likelihood methods. In general the likelihood function, C, for a set of parameters, 6, is given by 
a hypothesis, H x , for the distribution function of the data set. In the case of uniform prior, and as- 
suming a multivariate Gaussian distributed data set consistent with Inflationary models, the a posteriori 
probability distribution for the parameters is 
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where 9 — (Q,/i,Ooi^A>^&>— ) are the usual cosmological parameters we would like to determine. Ex- 
amples of data are x = At or ai, m and the statistics of the n data are fully parametrised by the data 
covariance matrix, C(ff) = (xx^). For simplicity here we assume the data have zero means. 



2 Problems with the likelihood method 



Two important questions we would like to settle about likelihood analysis are (a) is the method optimal 
in the sense that we get the minimum variance (smallest error bars) for a given amount of data? and (b) 
is the method efficient - can we realistically find the best-fitting parameters? As an example of this last 
point, if we have n data points (pixels, harmonic coefficients, etc), and m parameters to estimate with a 
sampling rate of 1 /<?, we find that the calculation time scales as 

TOcq m x n 3 (2) 

where the first term is just the total number of points at which we need to calculate the likelihood, and 
the second term is the time that it takes to calculate the inverse of C and its determinant. Of course, in 
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practice one would not find the maximum likelihood solution this way, but it serves to illustrate the point. 
Note that the covariance matrix depends on the parameters and therefore must be evaluated locally in 
parameter space. For MAP or Planck we have m ~ 11, q ~ 10 and n ~ 10 7 , resulting in r ~ Hq 1 , even for 
nanosecond technology. But before we give up in dismay, it is worth looking a bit further at the theory 
of parameter estimation. 



3 Parameter information and the solution to our problems 

Suppose we have found the maximum likelihood solutions for each parameter, 9 = 6q, then the likelihood 
function can be approximated by another multivariate Gaussian about this point; 



C(6\6 ,H e ) = (2ir)- m / 2 \F\ 1 / 2 exp 
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where 69 = 9 — 9q is the distance to the maximum in parameter space and the parameter covariance 
matrix is given by the inverse of F, the Fisher Information matrix; 

F ij = {56 i 8e j )- 1 = \^[A i A j ], (4) 



(if the means of the data are dependent on the parameters, this is modified - see TTH). The far 
right hand side expression can be calculated for Gaussian distributed data sets (ie equation 1), where 
Ai = d In C(9)/d9i is the slope of the log of the data covariance matrix in parameter space. 

By considering the Fisher matrix as the information content contained in the data set about each 
parameter, we see that the solution to our problem is to reduce the data set without changing the 
parameter information content. Hence to solve the problem of efficiency, we need to make a linear 
transformation of the data set 

x' = Bx, (5) 

where B is a n' x n matrix where n' < n, and so x' may be a smaller data set than x. If n' < n the 
transformation is not invertible and some information about the data has been lost. To ensure that the 
lost information does not affect the parameter estimation (requirement (a)), we also require 

8F' 

where F' = BFB' is the transformed Fisher matrix. In order to avoid learning the unhelpful fact that 
no data is an optimal solution, we add in the constraint that data exists. Since we have the freedom to 
transform the data covariance matrix, we add the constraint \{BCB^ — I), where I is the unit matrix 
and A is a Lagrangian multiplier. 



It can be shown (TTH) that this is equivalent to a generalised Karhunen-Loeve eigenvalue problem, 



which has a unique solution B for each parameter. These solutions have the property that 

B(d 6i C)B^ =XiI, (7) 

where Aj = 1/cr^ are the eigenvalues of the transformed data set and the inverse errors associated with 
each eigenmode of the new data set. 

The new, compressed data set, x', can now be ordered by decreasing eigenvalue, so that the first 
eigenmode contains the most information about the desired parameter, the second slightly less informa- 
tion, and so on. The total error on the parameter is then simply given by the inverse of the lxl Fisher 
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Figure 1: The 3 heavy lines show the error bars on 3 CMB parameters as a function of the number of modes 
used. Each set of modes has been optimised for the parameter in question. Note that approximately 400 
modes are all that is required to get virtually all the information from the entire 4016 cut COBE dataset. 
The thin lines show the conditional errors from the SVD procedure outlined in section 5: virtually all 
the (conditional) information on all 3 parameters is obtained from the best 500 SVD modes. 
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We are now free to choose how many eigcnmodes to include in the likelihood analysis. A compression 
of 10 will lead to a time saving of 10 3 . However this is only exact if we know the true value of the 
parameters used to calculate B. But if we are near the maximum likelihood solution then we can iterate 
towards the exact solution. 

This procedure is optimal for all parameters - linear and nonlinear - in the model. In the special 
case of linear parameters that are just proportional to the signal part of the data covariance matrix 
(for example the amplitude of Ci, if the data are the ag m ), the eigenmodes reduce to signal-to-noise 
eigenmodes ( Bond 1994 , Bunn fc Sugiyama 1995|) . Hence our eigenmodes are more general than signal- 



to-noise eigenmodes. Furthermore, as our eigenmodes satisfy the condition that the Fisher matrix is a 
maximum, they are the optimal ones for data compression. Any other choice, including signal-to-noise 
eigenmodes, would give a higher variance. 

In Figure 1 we plot the uncertainty on 3 parameters for COBE-type data, the quadrupole, Q, the 
spectral index of scalar perturbations, n and the re-ionization optical depth, r. 



4 Estimating many parameters at once 



The analysis presented so far is strictly optimal only for the conditional likelihood - the estimation of one 
parameter when all others are known. A far more challenging task is to optimise the data compression 
when all parameters are to be estimated from the data. In this case, the marginal error on a single 

parameter 9i rises above the conditional error 1 / y/Fu to y F^ 1 . As far as we are aware, there is no general 
solution known to this problem, but here we present some methods which have intuitive motivation and 
appear successful in practice. 
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Suppose that we repeat the optimisation procedure, outlined above, m times, once for each parameter. 
The union of these sets should do well at estimating all parameters, but the size may be large. However, 
many of the modes may contain similar information, and this dataset may be trimmed further without 
significant loss of information. This is effected by a singular value decomposition of the union of the 



modes, and modes corresponding to small singular values are excluded. Full details are given in TTH 



and an example from COBE is illustrated in Figure 1, which shows that for the conditional likelihoods 
at least, the data compression procedure can work extremely well. However, this on its own may not 
be sufficient to achieve small marginal errors, especially if two or more parameters are highly correlated. 
This is expected to be the case for high-resolution CMB experiments such as MAP and Planck (e.g. for 
parameters Q r ms and n). To give a more concrete example - a thin ridge of likelihood at 45° to two 
parameter axes has small conditional errors, but the marginal errors can be very large. This applies 
whether or not the likelihood surface can be approximated well by a bivariate Gaussian. 
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Figure 2: Illustration of data compression with different algorithms. Top left: 'Full' dataset of 508 modes 
(for details of parameters etc, see text). Top right: Best 320 modes optimised for measuring j3. Bottom 
left: Best 320 modes from SVD application to modes optimised for (3 and A. Bottom right: Best 320 
modes for optimising along the likelihood ridge axis. Likelihood contours are separated by 0.5 in natural 
log. 



This latter case motivates an alternative strategy, which recognises that the marginal error is domi- 
nated not by the curvature of the likelihood in the parameter directions, but by the curvature along the 
principal axis of the Hessian matrix with the smallest eigenvalue. Figure 2 shows how various strategies 
fare with a simultaneous estimation of the amplitude of clustering A and the redshift distortion parameter 
(3, in a simulation of the PSCz galaxy redshift survey. The top left panel shows the likelihood surface 
for the full set of 508 modes considered for this analysis (many more are used in the analysis of the real 
survey). The modes used, and indeed the parameters involved, are not important for the arguments 
here. We see that the parameter estimates are highly correlated. The second panel, top right, shows the 
single-parameter optimisation of the first part of this paper. The modes are optimised for /3, and only 
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the best 320 modes are used. We see that the conditional error in the j3 direction is not much worse 
than the full set, but the likelihood declines slowly along the ridge, and the marginal errors on both f3 
and A have increased substantially. In the panel bottom left, the SVD procedure has been applied to 
the union of modes optimised for (3 and A, keeping the best 320 modes. The procedure does reasonably 
well, but in this case the error along the ridge has increased. The bottom right graph shows the result 
of diagonalizing the Fisher matrix and optimising for the eigenvalue along the ridge. We see excellent 
behaviour for the best 320 modes, with almost no loss of information compared with the full set. This 
illustrative example shows how data compression may be achieved with good results by application of a 
combination of rigorous optimisation and a helping of common sense. 



5 Conclusions 



We have shown that single-parameter estimation by likelihood analysis can be made efficient in the sense 
that we can compress the original data set to make parameter estimation tractable, and it is optimal in 
the sense that there is no loss of information about the parameter we wish to estimate. Our eigenmodes 
are generalised versions of the signal-to-noise eigenmodes, and are optimal for parameters entering the 
data covariance matrix in arbitrary ways. 

As with all parameter estimation, this is a model-dependent method in the sense that we need only 
to know the covariance matrix of the data and the assumption of Gaussianity. However we have not had 
to introduce anything more than the standard assumptions of likelihood analysis. The dependence on 
the initial choice of parameter values is minimal, and can be reduced further by iteration. 

For many-parameter estimation, we have shown the effects of two algorithms for optimisation. Op- 
timising separately for several parameters by the single-parameter method, and trimming the resulting 
dataset via an SVD step is successful in recovering the conditional likelihood errors. For correlated pa- 
rameter estimates, a promising technique appears to be to diagonalize the Fisher matrix and optimise 
for the single parameter along the likelihood ridge. 
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